A Bigram Extension to Word Vector Representation
نویسندگان
چکیده
GloVe is an algorithm which associates a vector to each word such that the dot product of two words corresponds to the likelihood they appear together in a large corpus ([PSM14]). GloVe vectors achieve state-of-the-art performance on word analogy tasks (v(king) − v(man) + v(woman) ≈ v(queen)), but they are limited to capturing meanings of individual words. In our project, we develop “biGloVe,” a version of GloVe that learns vector representations of bigrams. Using the full English Wikipedia text as our training corpus, we compute 1.2 million bigram vectors in 150 dimensions. To evaluate the quality of our biGloVe vectors, we apply them to two machine learning tasks. The first task is a 2012 SemEval challenge where one must determine the semantic similarity of two sentences or phrases. We used logistic regression using as features the cosine similarity of the average sentence (bi)GloVe vectors and found slightly better performance in one challenge when GloVe and biGlove were combined, but generally, the usage of biGloVe vectors did not increase performance. Second, we applied biGloVe vectors to classify the sentiment of movie reviews, training with naive Bayes using bag-of-words, SVMs, and random forests. We found that naive Bayes or an SVM with GloVe vectors performed the best. Applications of biGloVe vectors were hindered by insufficient bigram coverage, despite training 1.2 million vectors. At the same time, examination of nearest neighbors revealed that biGloVe vectors were indeed capturing semantic relationships unique to bigrams, suggesting that the method has promise. Training new vectors on a much larger corpus such as Common Crawl is likely to improve performance of biGloVe vectors in tasks.
منابع مشابه
A Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملWord Pairs in Language Modeling for Information Retrieval
Previous language modeling approaches to information retrieval have focused primarily on single terms. The use of bigram models has been studied, but the restriction on word order and adjacency may not be justified for information retrieval. We propose a new language modeling approach to information retrieval that incorporates lexical affinities, or pairs of words that occur near each other, wi...
متن کاملInterpolated Distanced Bigram Language Models for Robust Word Clustering
Two methods for interpolating the distanced bigram language model are examined which take into account pairs of words that appear at varying distances within a context. The language models under study yield a lower perplexity than the baseline bigram model. A word clustering algorithm based on mutual information with robust estimates of the mean vector and the covariance matrix is employed in t...
متن کاملNew language models using phrase structures extracted from parse trees
This paper proposes a new speech recognition scheme using three linguistic constraints. Multi-class composite bigram models [1] are used in the first and second passes to reflect word-neighboring characteristics as an extension of conventional word n-gram models. Trigram models with constituent boundary markers and word pattern models are both used in the third pass to utilize phrasal constrain...
متن کاملSemantic Composition and Decomposition: From Recognition to Generation
Semantic composition is the task of understanding the meaning of text by composing the meanings of the individual words in the text. Semantic decomposition is the task of understanding the meaning of an individual word by decomposing it into various aspects (factors, constituents, components) that are latent in the meaning of the word. We take a distributional approach to semantics, in which a ...
متن کامل